54 research outputs found
Focus Is All You Need: Loss Functions For Event-based Vision
Event cameras are novel vision sensors that output pixel-level brightness
changes ("events") instead of traditional video frames. These asynchronous
sensors offer several advantages over traditional cameras, such as, high
temporal resolution, very high dynamic range, and no motion blur. To unlock the
potential of such sensors, motion compensation methods have been recently
proposed. We present a collection and taxonomy of twenty two objective
functions to analyze event alignment in motion compensation approaches (Fig.
1). We call them Focus Loss Functions since they have strong connections with
functions used in traditional shape-from-focus applications. The proposed loss
functions allow bringing mature computer vision tools to the realm of event
cameras. We compare the accuracy and runtime performance of all loss functions
on a publicly available dataset, and conclude that the variance, the gradient
and the Laplacian magnitudes are among the best loss functions. The
applicability of the loss functions is shown on multiple tasks: rotational
motion, depth and optical flow estimation. The proposed focus loss functions
allow to unlock the outstanding properties of event cameras.Comment: 29 pages, 19 figures, 4 table
MicroPoem: experimental investigation of birch pollen emissions
Diseases due to aeroallergens constantly increased over the last decades and affect more and more people. Adequate protective and pre-emptive measures require both reliable assessment of production and release of various pollen species, and the forecasting of their atmospheric dispersion. Pollen forecast models, which may be either based on statistical knowledge or full physical transport and dispersion modeling, can provide pollen forecasts with full spatial coverage. Such models are currently being developed in many countries. The most important shortcoming in these pollen transport systems is the description of emissions, namely the dependence of the emission rate on physical processes such as turbulent exchange or mean transport and biological processes such as ripening (temperature) and preparedness for release. Thus the quantification of pollen emissions and determination of the governing mesoscale and micrometeorological factors are subject of the present project MicroPoem, which includes experimental field work as well as numerical modeling. The overall goal of the project is to derive an emission parameterization based on meteorological parameters, eventually leading to enhanced pollen forecasts. In order to have a well-defined source location, an isolated birch pollen stand was chosen for the set-up of a ânatural tracer experiment', which was conducted during the birch pollen season in spring 2009. The site was located in a broad valley, where a mountain-plains wind system usually became effective during clear weather periods. This condition allowed to presume a rather persistent wind direction and considerable velocity during day- and nighttime. Several micrometeorological towers were operated up- and downwind of this reference source and an array of 26 pollen traps was laid out to observe the spatio-temporal variability of pollen concentrations. Additionally, the lower boundary layer was probed by means of a sodar and a tethered balloon system (also yielding a pollen concentration profile). In the present contribution a project overview is given and first results are presented. An emphasis is put on the relative performance of different sample technologies and the corresponding relative calibration in the lab and the field. The concentration distribution downwind of the birch stand exhibits a significant spatial (and temporal) variability. Small-scale numerical dispersion modeling will be used to infer the emission characteristics that optimally explain the observed concentration patterns
E-RAFT: Dense Optical Flow from Event Cameras
We propose to incorporate feature correlation and sequential processing into dense optical flow estimation from event cameras. Modern frame-based optical flow methods heavily rely on matching costs computed from feature correlation. In contrast, there exists no optical flow method for event cameras that explicitly computes matching costs. Instead, learning-based approaches using events usually resort to the U-Net architecture to estimate optical flow sparsely. Our key finding is that the introduction of correlation features significantly improves results compared to previous methods that solely rely on convolution layers. Compared to the state-of-the-art, our proposed approach computes dense optical flow and reduces the end-point error by 23% on MVSEC. Furthermore, we show that all existing optical flow methods developed so far for event cameras have been evaluated on datasets with very small displacement fields with maximum flow magnitude of 10 pixels. Based on this observation, we introduce a new real-world dataset that exhibits displacement fields with magnitudes up to 210 pixels and 3 times higher camera resolution. Our proposed approach reduces the end-point error on this dataset by 66%
DSEC: A Stereo Event Camera Dataset for Driving Scenarios
Once an academic venture, autonomous driving has received unparalleled
corporate funding in the last decade. Still, the operating conditions of
current autonomous cars are mostly restricted to ideal scenarios. This means
that driving in challenging illumination conditions such as night, sunrise, and
sunset remains an open problem. In these cases, standard cameras are being
pushed to their limits in terms of low light and high dynamic range
performance. To address these challenges, we propose, DSEC, a new dataset that
contains such demanding illumination conditions and provides a rich set of
sensory data. DSEC offers data from a wide-baseline stereo setup of two color
frame cameras and two high-resolution monochrome event cameras. In addition, we
collect lidar data and RTK GPS measurements, both hardware synchronized with
all camera data. One of the distinctive features of this dataset is the
inclusion of high-resolution event cameras. Event cameras have received
increasing attention for their high temporal resolution and high dynamic range
performance. However, due to their novelty, event camera datasets in driving
scenarios are rare. This work presents the first high-resolution, large-scale
stereo dataset with event cameras. The dataset contains 53 sequences collected
by driving in a variety of illumination conditions and provides ground truth
disparity for the development and evaluation of event-based stereo algorithms.Comment: IEEE Robotics and Automation Letter
Bridging the Gap Between Events and Frames Through Unsupervised Domain Adaptation
Reliable perception during fast motion maneuvers or in high dynamic range environments is crucial for robotic systems. Since event cameras are robust to these challenging conditions, they have great potential to increase the reliability of robot vision. However, event-based vision has been held back by the shortage of labeled datasets due to the novelty of event cameras. To overcome this drawback, we propose a task transfer method to train models directly with labeled images and unlabeled event data. Compared to previous approaches, (i) our method transfers from single images to events instead of high frame rate videos, and (ii) does not rely on paired sensor data. To achieve this, we leverage the generative event model to split event features into content and motion features. This split enables efficient matching between latent spaces for events and images, which is crucial for successful task transfer. Thus, our approach unlocks the vast amount of existing image datasets for the training of event-based neural networks. Our task transfer method consistently outperforms methods targeting Unsupervised Domain Adaptation for object detection by 0.26 mAP (increase by 93%) and classification by 2.7% accuracy
From Chaos Comes Order: Ordering Event Representations for Object Recognition and Detection
Today, state-of-the-art deep neural networks that process events first
convert them into dense, grid-like input representations before using an
off-the-shelf network. However, selecting the appropriate representation for
the task traditionally requires training a neural network for each
representation and selecting the best one based on the validation score, which
is very time-consuming. This work eliminates this bottleneck by selecting
representations based on the Gromov-Wasserstein Discrepancy (GWD) between raw
events and their representation. It is about 200 times faster to compute than
training a neural network and preserves the task performance ranking of event
representations across multiple representations, network backbones, datasets,
and tasks. Thus finding representations with high task scores is equivalent to
finding representations with a low GWD. We use this insight to, for the first
time, perform a hyperparameter search on a large family of event
representations, revealing new and powerful representations that exceed the
state-of-the-art. Our optimized representations outperform existing
representations by 1.7 mAP on the 1 Mpx dataset and 0.3 mAP on the Gen1
dataset, two established object detection benchmarks, and reach a 3.8% higher
classification score on the mini N-ImageNet benchmark. Moreover, we outperform
state-of-the-art by 2.1 mAP on Gen1 and state-of-the-art feed-forward methods
by 6.0 mAP on the 1 Mpx datasets. This work opens a new unexplored field of
explicit representation optimization for event-based learning.Comment: 15 pages, 11 figures, 2 tables, ICCV 2023 Camera Ready pape
Recurrent Vision Transformers for Object Detection with Event Cameras
We present Recurrent Vision Transformers (RVTs), a novel backbone for object
detection with event cameras. Event cameras provide visual information with
sub-millisecond latency at a high-dynamic range and with strong robustness
against motion blur. These unique properties offer great potential for
low-latency object detection and tracking in time-critical scenarios. Prior
work in event-based vision has achieved outstanding detection performance but
at the cost of substantial inference time, typically beyond 40 milliseconds. By
revisiting the high-level design of recurrent vision backbones, we reduce
inference time by a factor of 5 while retaining similar performance. To achieve
this, we explore a multi-stage design that utilizes three key concepts in each
stage: First, a convolutional prior that can be regarded as a conditional
positional embedding. Second, local- and dilated global self-attention for
spatial feature interaction. Third, recurrent temporal feature aggregation to
minimize latency while retaining temporal information. RVTs can be trained from
scratch to reach state-of-the-art performance on event-based object detection -
achieving an mAP of 47.5% on the Gen1 automotive dataset. At the same time,
RVTs offer fast inference (13 ms on a T4 GPU) and favorable parameter
efficiency (5 times fewer than prior art). Our study brings new insights into
effective design choices that could be fruitful for research beyond event-based
vision
Recurrent Vision Transformers for Object Detection with Event Cameras
We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with submillisecond latency at a high-dynamic range and with strong robustness against motion blur. These unique properties offer great potential for low-latency object detection and tracking in time-critical scenarios. Prior work in event-based vision has achieved outstanding detection performance but at the cost of substantial inference time, typically beyond 40 milliseconds. By revisiting the high-level design of recurrent vision backbones, we reduce inference time by a factor of 6 while retaining similar performance. To achieve this, we explore a multi-stage design that utilizes three key concepts in each stage: first, a convolutional prior that can be regarded as a conditional positional embedding. Second, local and dilated global self-attention for spatial feature interaction. Third, recurrent temporal feature aggregation to minimize latency while retaining temporal information. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection - achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time, RVTs offer fast inference (< 12 ms on a T4 GPU) and favorable parameter efficiency (5 Ă fewer than prior art). Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision
From Chaos Comes Order: Ordering Event Representations for Object Recognition and Detection
Today, state-of-the-art deep neural networks that process events first convert them into dense, grid-like input representations before using an off-the-shelf network. However, selecting the appropriate representation for the task traditionally requires training a neural network for each representation and selecting the best one based on the validation score, which is very time-consuming. This work eliminates this bottleneck by selecting representations based on the Gromov-Wasserstein Discrepancy (GWD) between raw events and their representation. It is about 200 times faster to compute than training a neural network and preserves the task performance ranking of event representations across multiple representations, network backbones, datasets, and tasks. Thus finding representations with high task scores is equivalent to finding representations with a low GWD. We use this insight to, for the first time, perform a hyperparameter search on a large family of event representations, revealing new and powerful representations that exceed the state-of-the-art. Our optimized representations outperform existing representations by 1.7 mAP on the 1 Mpx dataset and 0.3 mAP on the Gen1 dataset, two established object detection benchmarks, and reach a 3.8% higher classification score on the mini N-ImageNet benchmark. Moreover, we outperform state-of-the-art by 2.1 mAP on Gen1 and state-of-the-art feed-forward methods by 6.0 mAP on the 1 Mpx datasets. This work opens a new unexplored field of explicit representation optimization for event-based learning
E-RAFT: Dense Optical Flow from Event Cameras
We propose to incorporate feature correlation and sequential processing into dense optical flow estimation from event cameras. Modern frame-based optical flow methods heavily rely on matching costs computed from feature correlation. In contrast, there exists no optical flow method for event cameras that explicitly computes matching costs. Instead, learning-based approaches using events usually resort to the U-Net architecture to estimate optical flow sparsely. Our key finding is that the introduction of correlation features significantly improves results compared to previous methods that solely rely on convolution layers. Compared to the state-of-the-art, our proposed approach computes dense optical flow and reduces the end-point error by 23% on MVSEC. Furthermore, we show that all existing optical flow methods developed so far for event cameras have been evaluated on datasets with very small displacement fields with maximum flow magnitude of 10 pixels. Based on this observation, we introduce a new real-world dataset that exhibits displacement fields with magnitudes up to 210 pixels and 3 times higher camera resolution. Our proposed approach reduces the end-point error on this dataset by 66%
- âŚ